西班牙专利ES2787894T9 Method and device for detecting the audio signal

专利PDF首页>>西班牙专利

专利附录

专利说明

权利要求

类似技术

同族专利

引用文献

法律状态

优先权

专利摘要:

公开号:ES2787894T9
申请号:ES14885786T
申请日:2014-12-01
公开日:2021-12-28
发明作者:Zhe Wang
申请人:Huawei Technologies Co Ltd；
IPC主号:

专利说明:

[0002] Method and apparatus for detecting audio signals
[0004] technical field
[0006] The present invention relates to the field of signal processing technologies, and more specifically, to a method for detecting an audio signal and an apparatus.
[0008] Background
[0010] Voice Activity Detection (VAD) is a key technology widely used in fields such as voice communications and human-machine interaction. VAD may also be referred to as sound activity detection (SAD). The VAD is used to detect whether there is an active signal in an input audio signal, where the active signal is relative to an inactive signal (such as background ambient noise and a muted voice). Typical active signals include a voice, music, and the like. A principle of VAD is that one or more feature parameters are extracted from an input audio signal, one or more feature values are determined based on one or more feature parameters, and then one or more feature values are compared with one or more thresholds.
[0012] In the prior art, an active signal detection method based on a segmental signal-to-noise ratio (SSNR) includes: splitting an input audio signal into multiple sub-band signals in a frequency band, calculating energy of the audio signal in each subband, and comparing the energy of the audio signal in each subband with the estimated energy of a background noise signal in each subband, in order to obtain a signal-to-noise ratio (SNR) of the audio signal in each subband; and then determining an SSNR based on a subband SNR of each subband, and comparing the SSNR to a predetermined VAD decision threshold, where if the SSNR exceeds the VAD decision threshold, the audio signal is an active signal, or if the SSNR does not exceed the VAD decision threshold, the audio signal is an idle signal.
[0014] A typical method for calculating the SSNR is to add all the subband SNRs of the audio signal, and one result is the SSNR. For example, the SSNR can be determined using formula 1.1:
[0016] SSNR = Efc=o snr ( k) Formula 1.1
[0017] where k indicates the kth subband, snr(k) indicates a subband SNR of the kth subband, and N indicates a total subband number of subbands into which the audio signal is divided.
[0019] When the above method for calculating the SSNR is used to detect an active voice, an erroneous detection of an active voice may occur.
[0021] US2013/191117A1 describes that in speech processing systems, compensation is made for sudden changes in background noise in calculating the average signal-to-noise ratio (SNR). SNR outlier filtering can be used, alone or in conjunction with average SNR weighting. Adaptive weights can be applied to the per-band SNRs before calculating the average SNR. The weighting function may be a function of noise level, noise type, and/or instantaneous SNR value.
[0023] US 2013/304464A1 describes a method and apparatus for adaptively detecting speech activity in an input audio signal composed of frames. The method comprises the steps of: determining a noise characteristic of the input signal based on a received frame of the input audio signal; deriving a voice activity detection (VAD) parameter based on the noise characteristic of the input audio signal; and comparing the derived VAD parameter to the threshold value to provide a voice activity detection decision.
[0025] WEIWU JIANG ET COL.: “a new voice activity detection method using maximized sub-band SNR” describes a VAD method that uses a Maximum SNR Value of subband (MVSS) as the detection characteristic. The proposed new MVSS feature has different distributions between the voice signal and the non-voice signal, which is useful for separating the voice signal from loud noise. An adaptive threshold is applied to improve VAD accuracies and track the noisy signal quickly without complex calculations.
[0027] Compendium
[0029] The present invention provides a method for detecting an audio signal and an apparatus, which can accurately distinguish between an active voice and an inactive voice.
[0031] The invention is defined in the appended claims. In the following, the word "embodiment(s)" is used, if it refers to combinations of features other than those defined by the claims, related to examples that have originally been filed but do not represent embodiments of the presently claimed invention; these examples are still shown, for illustrative purposes only.
[0032] According to a first aspect, the present invention provides a method for detecting an audio signal, wherein the method includes: determining an input audio signal as an audio signal to be determined if the audio signal is determined to be an audio signal. no voice signal; determining an enhanced segmental signal-to-noise ratio (SSNR) of the audio signal, where the enhanced SSNR is greater than a reference SSNR; and comparing the enhanced SSNR to a voice activity detection (VAD) decision threshold to determine whether the audio signal is an active signal, wherein determining the enhanced SSNR of the audio signal comprises: determining the SSNR of audio signal reference; and determining the enhanced SSNR according to the reference SSNR of the audio signal.
[0033] With reference to the first aspect, in a first possible way of implementing the first aspect, determining the enhanced SSNR according to the reference SSNR of the audio signal includes: determining the enhanced SSNR using the following formula: SSNR' = x * SSNR and, where SSNR indicates the reference SSNR, SSNR' indicates the enhanced SSNR, and x and y indicate the enhancement parameters.
[0035] According to a second aspect, the present invention provides a method of detecting an audio signal, wherein the method includes: determining an input audio signal as an audio signal to be determined if the audio signal is determined to be an audio signal. no voice signal; determining a weight of a subband SNR of each subband in the audio signal, where a weight of a subband SNR of a high-frequency portion subband whose subband SNR is greater than a first predetermined threshold is greater than a weight of a subband SNR of another subband; determine an enhanced SSNR according to the subband SNR of each subband and the weighting of the subband SNR of each subband in the audio signal, where the enhanced SSNR is greater than a reference SSNR; and comparing the enhanced SSNR to a VAD decision threshold to determine if the audio signal is an active signal.
[0037] According to a third aspect, the present invention provides an apparatus, wherein the apparatus includes: a first determination unit, configured to determine an input audio signal as an audio signal to be determined if it is determined that the audio signal it is a voiceless signal; a second determination unit, configured to determine an enhanced SSNR of the audio signal, where the enhanced SSNR is greater than a reference SSNR; and a third determining unit, configured to compare the enhanced SSNR with a VAD decision threshold to determine if the audio signal is an active signal, wherein the second determining unit is specifically configured to determine the reference SSNR of the audio signal and determining the enhanced SSNR according to the reference SSNR of the audio signal.
[0039] With reference to the fourth aspect, in a first possible way of implementing the fourth aspect, the second determining unit is specifically configured to determine the enhanced SSNR using the following formula: SSNR' = x * SSNR y, where SSNR indicates the reference SSNR , SSNR' indicates the enhanced SSNR, and x and y indicate the enhancement parameters.
[0041] According to a fourth aspect, the present invention provides an apparatus, wherein the apparatus includes: a first determining unit, configured to determine an input audio signal as an audio signal to be determined if it is determined that the audio signal it is a voiceless signal; a second determining unit, configured to determine a weight of a subband SNR of each subband in the audio signal, wherein a weight of a subband SNR of a high frequency portion subband whose subband SNR is greater than a first predetermined threshold that is greater than a weight of a subband SNR of another subband, and determines an enhanced SSNR based on the subband SNR of each subband and the weight of the subband SNR of each subband in the audio signal, where the SSNR enhanced is greater than a baseline SSNR; and a third determination unit, configured to compare the enhanced SSNR with a VAD decision threshold to determine whether the audio signal is an active signal.
[0043] According to the method of the present invention, a characteristic of an audio signal can be determined, an enhanced SSNR is determined correspondingly according to the characteristic of the audio signal, and the enhanced SSNR is compared with a decision threshold of VAD , so that an erroneous detection rate of an active signal can be reduced.
[0045] Brief description of the drawings
[0047] In order to describe the technical solutions of the present invention more clearly, the attached drawings describing the embodiments of the present invention are briefly introduced below. Apparently, the attached drawings in the following description simply show some embodiments of the present invention.
[0049] the fig. 1 is a schematic flow diagram of a method for detecting an audio signal according to an embodiment of the present invention;
[0051] the fig. 2 is a schematic flow diagram of a method for detecting an audio signal according to an embodiment of the present invention;
[0053] the fig. 3 is a schematic flow diagram of a method for detecting an audio signal according to an embodiment of the present invention;
[0054] the fig. 4 is a schematic flow diagram of a method for detecting an audio signal according to an embodiment of the present invention;
[0056] the fig. 5 is a structural block diagram of an apparatus according to an embodiment of the present invention;
[0058] the fig. 6 is a structural block diagram of another apparatus according to an embodiment of the pres
[0059] the fig. 7 is a structural block diagram of an apparatus according to an embodiment of the present invention;
[0061] the fig. 8 is a structural block diagram of another apparatus according to an embodiment of the pres
[0062] the fig. 9 is a structural block diagram of another apparatus according to an embodiment of the pres
[0063] the fig. 10 is a structural block diagram of another apparatus according to an embodiment of the present invention.
[0065] Description of achievements
[0067] The following clearly and fully describes the technical solutions of the present invention with reference to the attached drawings showing preferred embodiments of the present invention. Apparently, the described embodiments are just some, but not all embodiments of the present invention.
[0069] the fig. 1 is a schematic flow diagram of a method for detecting an audio signal according to an embodiment of the present invention.
[0071] 101. Determining an input audio signal as an audio signal to be determined.
[0073] 102. Determine an enhanced SSNR of the audio signal, where the enhanced SSNR is greater than a reference SSNR.
[0075] 103. Compare the enhanced SSNR to a VAD decision threshold to determine if the audio signal is an active signal.
[0077] In this embodiment of the present invention, when the enhanced SSNR is compared to the decision threshold of
[0078] VAD, a reference VAD decision threshold may be used, or a reduced VAD decision threshold obtained after a reference VAD decision threshold is reduced using a predetermined algorithm may be used. The reference VAD decision threshold may be a default VAD decision threshold, and the reference VAD decision threshold may be pre-stored, or may be temporarily obtained through computation, where the reference VAD decision threshold can be calculated using well-known existing technology. When the reference VAD decision threshold is lowered using the predetermined algorithm, the predetermined algorithm may be multiplying the reference VAD decision threshold by a coefficient that is less than 1, or another algorithm may be used. This embodiment of the present invention imposes no limitation on a specific algorithm used.
[0080] When a conventional SSNR calculation method is used to calculate the SSNRs of some audio signals, the
[0081] SSNR of these audio signals may be less than a predetermined VAD decision threshold. However, in reality, these audio signals are active audio signals. This is caused by the characteristics of these audio signals. For example, in a case where an ambient SNR is relatively low, a subband SNR of a high-frequency part is significantly lowered. Furthermore, since a psychoacoustic theory is generally used to perform the subband division, the subband SNR of the high-frequency part has a relatively low contribution to an SSNR. In this case, for some signals, such as a non-speech signal, whose energy is mainly centered in a relatively high-frequency part, an SSNR obtained through calculation using the conventional SSNR calculation method may be lower than the decision threshold of VAD, which causes an erroneous detection of an active signal. For another example, for some audio signals, the energy distribution of these audio signals is relatively flat in a spectrum, but the overall energy of these audio signals is relatively low. Therefore, in the case where an ambient SNR is relatively low, an SSNR obtained through calculation using the conventional SSNR calculation method may be lower than the VAD decision threshold. In the method shown in FIG. 1, a way to suitably increase an SSNR is used, such that the SSNR can be greater than a VAD decision threshold. Therefore, an erroneous detection rate of an active signal can be effectively reduced.
[0083] the fig. 2 is a schematic flow diagram of a method for detecting an audio signal according to an embodiment of the present invention.
[0085] 201. Determine a subband SNR of an input audio signal.
[0087] A spectrum of the input audio signal is divided into N subbands, where N is a positive integer greater than 1.
[0088] Specifically, a psychoacoustic theory can be used to split the spectrum of the audio signal. In a case where psychoacoustic theory is used to divide the spectrum of the audio signal, the lower the frequency of the subband, the narrower the bandwidth of the subband. The higher the frequency of the subband, the greater the bandwidth of the subband. Indeed, the spectrum of the audio signal can also be divided in another way, for example a way to evenly divide the spectrum of the audio signal into N subbands. A subband SNR is calculated for each subband of the input audio signal, where the subband SNR is a ratio of energy in the subband to background noise energy in the subband. The energy of the noise floor in the subband is generally an estimated value obtained by estimation by a noise floor estimator. How to use the noise floor estimator to estimate the noise floor energy corresponding to each subband is well known technology in this field. Therefore, it is not necessary to describe details here. One skilled in the art can understand that the subband SNR may be a direct energy ratio, or it may be another form of expression of a direct energy ratio, such as a logarithmic subband SNR. Furthermore, one skilled in the art may further understand that the subband SNR may also be a subband SNR obtained after linear or non-linear processing is performed on a direct subband SNR, or may be another transformation of the subband SNR. subband. The forward energy ratio of the subband SNR is shown in the following formula:
[0090] snr{k) = E(k)/Ett(k) _{Formula 1.2}
[0091] where snr(k) denotes a subband SNR of the kth subband, and E(k) and En(k) denote the energy of the kth subband and the energy of the noise floor in the kth subband, respectively . A log subband SNR can be ^{denoted as: snr log} (k) = 10xlog ⁱ⁰ snr(k), where snr ^log (k) denotes a log subband SNR of the kth subband, and snr(k) denotes a of subband that is of the k-th subband and is obtained through calculation using formula 1.2. One skilled in the art may further understand that the subband energy used to calculate a subband SNR may be input audio signal energy in a subband, or may be energy obtained after the background noise energy in a subband. it is subtracted from the energy of the input audio signal in the subband. The calculation of the SNR is correct without deviating from the meaning of the SNR.
[0093] 202. Determining the input audio signal as an audio signal to be determined.
[0095] Optionally, in one embodiment, determining the input audio signal as an audio signal to be determined may include: determining the audio signal as an audio signal to be determined according to the subband SNR that is of the audio signal and determined in step 201.
[0097] Optionally, in one embodiment, in a case where the audio signal is determined as an audio signal to be determined according to the subband SNR of the audio signal, determining the input audio signal as a audio signal to be determined includes: determining the audio signal as an audio signal to be determined in a case where a number of high-frequency portion subbands that are in the audio signal and whose Subband SNR are greater than a first predetermined threshold is greater than a first amount.
[0099] Optionally, in another embodiment, in a case where the audio signal is determined as an audio signal to be determined according to the subband SNR of the audio signal, determining the input audio signal as a audio signal to be determined includes: determining the audio signal as an audio signal to be determined in a case where a number of high-frequency portion subbands that are in the audio signal and whose Subband SNRs are greater than a first predetermined threshold is greater than a second quantity, and a number of low-frequency extremity subbands that are in the audio signal and whose subband SNRs are less than a second predetermined threshold is greater than a third amount. In this embodiment of the present invention, a high-frequency portion and a low-frequency extremity of an audio signal frame are relative, that is, a portion having a relatively high frequency is the high-frequency portion, and a portion that has a relatively low frequency is the low-frequency limb.
[0101] Optionally, in another embodiment, in a case where the audio signal is determined as an audio signal to be determined according to the subband SNR of the audio signal, determining the input audio signal as a audio signal to be determined includes: determining the audio signal as an audio signal to be determined in a case where a number of subbands that are in the audio signal and whose SNR values of subband are greater than a predetermined third threshold is greater than a fourth amount.
[0102] The first predetermined threshold and the second predetermined threshold can be obtained by collecting statistics based on a large number of speech samples. Specifically, statistics on the subband SNRs of the high-frequency portion subbands are collected on a large number of non-speech samples including background noise, and the first predetermined threshold is determined based on the subband SNRs, so that the Subband SNRs of most of the high-frequency portion subbands in these non-voiced samples are greater than the first predetermined threshold. Similarly, statistics on the subband SNRs of the low-frequency extremity subbands are collected on these non-speech samples, and the second predetermined threshold is determined based on the subband SNRs, such that the subband SNRs of most of the low frequency extremity subbands in these non-voiced samples are less than the second predetermined threshold.
[0103] The third default threshold is also obtained by collecting statistics. Specifically, the third predetermined threshold is determined according to the subband SNRs of a large number of noise signals, such that the subband SNRs of most of the subbands in these noise signals are less than the third predetermined threshold.
[0105] The first amount, the second amount, the third amount and the fourth amount are also obtained by collecting statistics. The first quantity is used as an example, where in a large number of non-voice sample frames, including noise, statistics are collected on a number of high-frequency portion subband subbands whose subband SNRs are greater than the first threshold predetermined, and the first number is determined according to the number, such that a number of high-frequency portion subbands found in most of these non-speech sample frames and whose subband SNRs are greater than the first predetermined threshold is greater than the first amount. A method of acquiring the second amount is similar to a method of acquiring the first amount. The second amount may be the same as the first amount, or the second amount may be different from the first amount. Similarly, for the third quantity, in the large number of sample frames without speech, including noise, statistics are collected on a number of subbands of low-frequency extremity subbands whose subband SNRs are less than the second predetermined threshold , and the third quantity is determined by the quantity such that a number of low-frequency extremity subbands found in most of these non-speech sample frames and whose subband SNRs are less than the second predetermined threshold is greater than the third amount. For the fourth quantity, in a large number of noise signal frames, statistics are collected on a number of subbands whose subband SNRs are less than the predetermined third threshold, and the fourth quantity is determined according to the quantity, such that a number of subbands found in most of these noise sample frames and whose subband SNRs are less than the third predetermined threshold is greater than the fourth number.
[0107] Optionally, in another embodiment, whether the input audio signal is an audio signal to be determined may be determined by determining whether the input audio signal is a non-speech signal. In this case, it is not necessary to determine the subband SNR of the audio signal when determining whether the audio signal is an audio signal to be determined. In other words, operation 201 need not be performed when determining whether the audio signal is an audio signal to be determined. Specifically, determining the input audio signal as an audio signal to be determined includes: determining the audio signal as an audio signal to be determined in a case where it is determined that the input signal audio is a non-speech signal. Specifically, one skilled in the art can understand that there may be multiple methods of detecting whether the audio signal is a non-speech signal. For example, whether the audio signal is a non-speech signal can be determined by detecting a time domain zero crossing rate (ZCR) of the audio signal. Specifically, in a case where the ZCR of the audio signal is greater than a ZCR threshold, the audio signal is determined to be a non-speech signal, where the ZCR threshold is determined according to a large number of experiments.
[0109] 203. Determine an enhanced SSNR of the audio signal, where the enhanced SSNR is greater than a reference SSNR.
[0111] The reference SSNR can be an SSNR obtained through calculation using formula 1.1. From formula 1.1 it can be seen that weighting processing is not performed on a subband SNR of any subband when the reference SSNR is being calculated, i.e. the weights of the subband SNRs of all subbands are equal when baseline SSNR is being calculated.
[0113] Optionally, in one embodiment, in a case where the number of high-frequency portion subbands found in the audio signal and whose subband SNRs are greater than the first predetermined threshold is greater than the first number, or in a case where the number of high-frequency portion subbands found in the audio signal and whose subband SNRs are greater than the first predetermined threshold is greater than the second number, and the number of low-end subbands found in the audio signal and whose subband SNRs are less than the second predetermined threshold is greater than the third quantity, determining an enhanced SSNR of the audio signal includes: determining a weight of a subband SNR of each subband band in the audio signal, where a weight of a high-frequency portion subband whose subband SNR is greater than the first predetermined threshold is greater than a weight of an SNR of subband of another subband; and determining the enhanced SSNR according to the subband SNR of each subband and the weighting of the subband SNR of each subband in the audio signal.
[0115] For example, if the audio signal is divided into 20 subbands, i.e. subband 0 to subband 19, according to psychoacoustic theory, and the signal-to-noise ratios of subband 18 and subband 19 are both greater than With a first predetermined T1 value, four subbands can be added, that is, subband 20 to subband 23. Specifically, subband 18 and subband 19 whose signal-to-noise ratios are larger than T1 can be respectively divided into subband 18a, the subband 18b and subband 18c; and subband 19a, subband 19b and subband 19c. In this case, subband 18 can be considered as a parent subband of subband 18a, subband 18b and subband 18c, and subband 19 can be considered as a parent subband of subband - band 19a, subband 19b and subband 19c. The values of the signal-to-noise ratios of subband 18a, subband 18b and subband 18c are the same as a value of the signal-to-noise ratio of their parent subband and the values of the signal-to-noise ratios of subband 19a, subband 19b, and subband 19c are the same as a signal-to-noise ratio value of their parent subband. In this way, the 20 subbands that were originally obtained through division are further divided into 24 subbands. Since the VAD is still designed around the 20 sub-bands during active signal detection, the 24 sub-bands need to be mapped back to the 20 sub-bands to determine the enhanced SSNR. In conclusion, when the enhanced SSNR is determined by increasing the number of high-frequency portion subbands whose subband SNRs are greater than the first predetermined threshold, the calculation can be performed using the following formula:
[0117]
[0119] where SSNR' indicates the enhanced SSNR, and snr(k) indicates a subband SNR of the kth subband.
[0121] If an SSNR obtained through the calculation using formula 1.1 is the reference SSNR, the reference SSNR obtained through the calculation is X£= 0snr ( k). Obviously, for an audio signal of a first type, a value of the enhanced SSNR obtained through the calculation using the formula 1.3 is larger than a value of the reference SSNR obtained through the calculation using the formula 1.1.
[0123] For another example, if the audio signal is divided into 20 subbands, that is, from subband 0 to subband 19, according to psychoacoustic theory, snr(18) and snr(19) are both greater than a first value T1 default and snr(0) to snr(17) are all less than a second default threshold T2, the enhanced SSNR can be determined using the following:
[0125] SSNR' = ax ^x snr{ 8) ⁺ a, ^x snr ( 19) £ snr ( k )
[0126] Formula 1.4
[0127] where SSNR' indicates the enhanced SSNR, snr(k) indicates a subband SNR of the kth subband, a1 and a2 are weight gain parameters, and values of a1 and a2 make a1 x snr(18) a2 x snr( 19) greater than snr(18) snr(19). Obviously, an enhanced SSNR value obtained through calculation using formula 1.4 is higher than the reference SSNR value obtained through calculation using formula 1.1.
[0129] Optionally, in another embodiment, determining an enhanced SSNR of the audio signal includes: determining a reference SSNR of the audio signal and determining the enhanced SSNR according to the reference SSNR of the audio signal.
[0131] Optionally, the enhanced SSNR can be determined using the following formula:
[0133] SSNR'= x* SSNR y_{Formula 1.5}
[0134] where SSNR indicates the reference SSNR of the audio signal, SSNR' indicates the enhanced SSNR, and x and y indicate enhancement parameters. For example, a value of x may be 1.05, and a value of y may be 1. One skilled in the art can understand that the values of x and y may be other suitable values that make the improved SSNR greater than the improved SSNR. Reference SSNR correctly.
[0136] Optionally, the enhanced SSNR can be determined using the following formula:
[0138] Formula 1.6
[0139] where SSNR indicates an original SSNR of the audio signal, SSNR' indicates the enhanced SSNR, and /(x) and h(y) indicate enhancement functions. For example, /(x ) and h(y) may be functions related to a long-term signal-to-noise ratio (LSNR) of the audio signal, where the LSNR of the audio signal is either an average SNR or a weighted SNR within over a relatively long period of time. For example, when lsnr is greater than 20, f(lsnr) can be equal to 1.1, and y(lsnr) can be equal to 2; when lsnr is less than 20 and greater than 15, f(lsnr) may be equal to 1.05, and y(lsnr) may be equal to 1; and when lsnr is less than 15, f(lsnr) can be equal to 1 and y(lsnr) can be equal to 0. One skilled in the art can understand that /(x) and h(y) can be in other forms that make the Enhanced SSNR higher than the Reference SSNR correctly.
[0141] 204. Compare the enhanced SSNR to a VAD decision threshold to determine if the audio signal is an active signal.
[0143] Specifically, when the enhanced SSNR is compared to the VAD decision threshold, if the enhanced SSNR is greater than the VAD decision threshold, the audio signal is determined to be an active signal; or if the enhanced SSNR is not greater than the VAD decision threshold, the audio signal is determined to be an idle signal.
[0144] Optionally, in another embodiment, before comparing the enhanced SSNR to a VAD decision threshold, the method may further include: using a predetermined algorithm to lower the VAD decision threshold to obtain a VAD decision threshold reduced. In this case, comparing the enhanced SSNR to a VAD decision threshold specifically includes: comparing the enhanced SSNR to the reduced VAD decision threshold to determine whether the audio signal is an active signal. A reference VAD decision threshold may be a default VAD decision threshold, and the reference VAD decision threshold may be pre-stored or may be temporarily obtained through computation, where the reference VAD decision threshold may be calculated using well-known existing technology. When the reference VAD decision threshold is lowered using the predetermined algorithm, the predetermined algorithm may be multiplying the reference VAD decision threshold by a coefficient that is less than 1, or another algorithm may be used. This embodiment of the present invention imposes no limitation on a specific algorithm used. The VAD decision threshold can be appropriately lowered using the predetermined algorithm such that the enhanced SSNR is greater than the lowered VAD decision threshold. Therefore, an erroneous detection rate of an active signal can be reduced.
[0146] According to the method shown in fig. 2, a characteristic of an audio signal is determined, an enhanced SSNR is correspondingly determined according to the characteristic of the audio signal, and the enhanced SSNR is compared to a VAD decision threshold. In this way, an erroneous detection rate of an active signal can be reduced.
[0148] the fig. 3 is a schematic flow diagram of a method for detecting an audio signal according to an embodiment of the present invention.
[0150] 301. Determining an input audio signal as an audio signal to be determined.
[0152] 302. Determining a weight of a subband SNR of each subband in the audio signal, where a weight of a subband SNR of a high-frequency portion subband whose subband SNR is greater than a first predetermined threshold is greater than a weight of a subband SNR of another subband.
[0154] 303. Determining an enhanced SSNR according to the subband SNR of each subband and the weight of the subband SNR of each subband in the audio signal, where the enhanced SSNR is greater than a reference SSNR.
[0156] The reference SSNR can be an SSNR obtained through calculation using formula 1.1. From formula 1.1 it can be seen that the weighting processing is not performed on a subband SNR of any subband when calculating the reference SSNR, i.e. the weights of the subband SNRs of all subbands are equal when Reference SSNR is being calculated.
[0158] For example, if the audio signal is divided into 20 subbands, that is, from subband 0 to subband 19, according to a psychoacoustic theory, and the signal-to-noise ratios of subband 18 and subband 19 are both greater than a first predetermined T1 value, four subbands can be added, that is, subband 20 to subband 23. Specifically, subband 18 and subband 19 whose signal-to-noise ratios are larger than T1 can be respectively divided into subband 18a, subband 18b and subband 18c; and subband 19a, subband 19b and subband 19c. In this case, subband 18 can be considered as a parent subband of subband 18a, subband 18b and subband 18c, and subband 19 can be considered as a parent subband of subband - band 19a, subband 19b and subband 19c. The signal-to-noise ratio values of subband 18a, subband 18b, and subband 18c are the same as a signal-to-noise ratio value of their parent subband and the signal-to-noise ratio values of subband 19a, subband 19b and subband 19c are the same as a signal-to-noise ratio value of their parent subband. In this way, the 20 subbands that were originally obtained through division are further divided into 24 subbands. Since the VAD is still designed according to the 20 sub-bands during active signal detection, the 24 sub-bands need to be mapped back to the 20 sub-bands to determine the enhanced SSNR. In conclusion, when the enhanced SSNR is determined by increasing a number of high-frequency portion subbands whose subband SNRs are greater than the first predetermined threshold, the calculation can be performed using the following formula:
[0162] where SSNR' indicates the enhanced SSNR, and snr(k) indicates a subband SNR of the kth subband.
[0164] If an SSNR obtained through the calculation using formula 1.1 is the reference SSNR, the reference SSNR obtained through the calculation is Efc= 0snr ( k). Obviously, for an audio signal of a first type, an enhanced SSNR value obtained through calculation using formula 1.3 is larger than a reference SSNR value obtained through calculation using formula 1.1.
[0166] For another example, if the audio signal is divided into 20 subbands, i.e. subband 0 to subband 19, according to psychoacoustic theory, snr(18) and snr(19) are both greater than a first predetermined T1 value and snr(0) to snr(17) are all less than a predetermined second threshold T2, the enhanced SSNR can be determined using the following formula:
[0168] SSNR' = ax ^x snr{ 8) ⁺ a, ^x snr ( 19) £ snr ( k )
[0169] Formula 1.4
[0170] where SSNR' indicates the enhanced SSNR, snr(k) indicates a subband SNR of the kth subband, a ¹ and a ² are weighted parameters and values of ai and a ² make ai x snr(18) a ² x snr(19) greater than snr(18) snr(19).
[0171] Obviously, an enhanced SSNR value obtained through calculation using formula 1.4 is higher than the reference SSNR value obtained through calculation using formula 1.1.
[0173] 304. Compare the enhanced SSNR to a VAD decision threshold to determine if the audio signal is an active signal.
[0175] Specifically, when the enhanced SSNR is compared to the VAD decision threshold, if the enhanced SSNR is greater than the VAD decision threshold, the audio signal is determined to be an active signal; or if the enhanced SSNR is not greater than the VAD decision threshold, the audio signal is determined to be an idle signal.
[0176] According to the method shown in fig. 3, a characteristic of an audio signal can be determined, an enhanced SSNR is determined in a corresponding manner according to the characteristic of the audio signal, and the enhanced SSNR is compared to a VAD decision threshold. Therefore, an erroneous detection rate of an active signal can be reduced.
[0178] Further, determining an input audio signal as an audio signal to be determined includes: determining the audio signal as an audio signal to be determined according to a subband SNR of the audio signal.
[0180] Optionally, in one embodiment, in a case where the audio signal is determined as an audio signal to be determined according to the subband SNR of the audio signal, determining the audio signal as an audio signal audio to be determined includes: determining the audio signal as an audio signal to be determined in a case where a number of high-frequency portion subbands that are in the audio signal and whose SNRs are subband are greater than the first predetermined threshold is greater than a first quantity
[0182] Optionally, in another embodiment, in a case where the audio signal is determined as an audio signal to be determined according to the subband SNR of the audio signal, determining an audio signal as an audio signal audio to be determined includes: determining the audio signal as an audio signal to be determined in a case where a number of high-frequency portion subbands that are in the audio signal and whose SNRs are subband are greater than the first predetermined threshold is greater than a second number, and a number of low-frequency extremity subbands found in the audio signal and whose subband SNRs are less than a second predetermined threshold is greater than a third amount.
[0184] The first predetermined threshold and the second predetermined threshold can be obtained by collecting statistics based on a large number of speech samples. Specifically, statistics on the subband SNRs of the high-frequency portion subbands are collected on a large number of non-speech samples, including background noise, and the first predetermined threshold is determined based on the subband SNRs, such that the subband SNRs of most of the final high-frequency portion subbands in these non-speech samples are greater than the first predetermined threshold. Similarly, statistics on the subband SNRs of the low-frequency extremity subbands are collected on these non-speech samples, and the second predetermined threshold is determined based on the subband SNRs, such that the subband SNRs of the majority of the low-frequency extremity subbands in these unvoiced samples are less than the second predetermined threshold.
[0186] The first amount, the second amount and the third amount are also obtained by collecting statistics. The first quantity is used as an example, where in a large number of non-voice sample frames, including noise, statistics are collected on a number of high-frequency portion subband subbands whose subband SNRs are greater than the first threshold predetermined, and the first number is determined by the number, such that a number of high-frequency portion subbands found in most of these non-speech sample frames and whose subband SNRs are greater than the first predetermined threshold is greater than the first amount. A method of acquiring the second amount is similar to a method of acquiring the first amount. The second amount may be the same as the first amount, or the second amount may be different from the first amount. Similarly, for the third quantity, on the large number of sample frames without speech, including noise, statistics are collected on a number of low-frequency extremity subband subbands whose subband SNRs are less than the second threshold and the third number is determined according to the number, such that a number of low-frequency extremity subbands found in most of these non-speech sample frames and whose subband SNRs are less than the second predetermined threshold is greater than the third amount.
[0188] In embodiments of fig. 1 to fig. 3, whether an input audio signal is an active signal is casually determined using enhanced SSNR. In a method shown in fig. 4, whether an input audio signal is an active signal is determined as it were by lowering a VAD decision threshold.
[0190] the fig. 4 is a schematic flow diagram of a method for detecting an audio signal according to an embodiment of the present invention.
[0192] 401. Determining an input audio signal as an audio signal to be determined.
[0194] Optionally, in one embodiment, determining an input audio signal as an audio signal to be determined includes: determining the audio signal as an audio signal to be determined according to the subband SNR that is from the audio signal and determined in step 201.
[0196] Optionally, in one embodiment, in a case where the audio signal is determined as an audio signal to be determined according to the subband SNR of the audio signal, determining an input audio signal as a audio signal to be determined includes: determining the audio signal as an audio signal to be determined in a case where a number of high-frequency portion subbands that are in the audio signal and whose Subband SNR are greater than a first predetermined threshold is greater than a first amount.
[0198] Optionally, in another embodiment, in a case where the audio signal is determined as an audio signal to be determined according to the subband SNR of the audio signal, determining an input audio signal as a audio signal to be determined includes: determining the audio signal as an audio signal to be determined in a case where a number of high-frequency portion subbands that are in the audio signal and whose Subband SNRs are greater than a first predetermined threshold is greater than a second number, and a number of subbands of low-frequency extremities found in the audio signal and whose subband SNRs are less than a second predetermined threshold is greater than a third amount.
[0200] Optionally, in another embodiment, in a case where the audio signal is determined as an audio signal to be determined according to the subband SNR of the audio signal, determining an input audio signal as a audio signal to be determined includes: determining the audio signal as an audio signal to be determined in a case where a number of subbands that are in the audio signal and whose SNR values of subband are greater than a predetermined third threshold is greater than a fourth amount.
[0202] The first predetermined threshold and the second predetermined threshold can be obtained by collecting statistics based on a large number of speech samples. Specifically, statistics on the subband SNRs of the high-frequency portion subbands are collected on a large number of non-speech samples, including background noise, and the first predetermined threshold is determined based on the subband SNRs, such that the subband SNRs of most of the high-frequency portion subbands in these non-speech samples are greater than the first predetermined threshold. Similarly, statistics on the subband SNRs of the low-frequency extremity subbands are collected on these non-speech samples, and the second predetermined threshold is determined based on the subband SNRs, such that the subband SNRs of the majority of the low-frequency extremity subbands in these unvoiced samples are less than the second predetermined threshold.
[0204] The third default threshold is also obtained by collecting statistics. Specifically, the third predetermined threshold is determined according to the subband SNRs of a large number of noise signals, such that the subband SNRs of most of the subbands in these noise signals are less than the third predetermined threshold.
[0206] The first amount, the second amount, the third amount and the fourth amount are also obtained by collecting statistics. The first quantity is used as an example, where in a large number of non-voice sample frames, including noise, statistics are collected on a number of high-frequency portion subband subbands whose subband SNRs are greater than the first threshold predetermined, and the first number is determined according to the number, such that a number of high-frequency portion subbands found in most of these non-speech sample frames and whose subband SNRs are greater than the first predetermined threshold is greater than the first amount. A method of acquiring the second amount is similar to a method of acquiring the first amount. The second amount may be the same as the first amount, or the second amount may be different from the first amount. Similarly, for the third quantity, in the large number of sample frames without speech, including noise, statistics are collected on a number of low-frequency extremity subband subbands whose subband SNRs are less than the second predetermined threshold , and the third quantity is determined by quantity, so that a quantity of low-frequency extremity subbands found in most of these non-speech sample frames and whose subband SNRs are less than the second predetermined threshold is greater than the third quantity. For the fourth quantity, in a large number of noise signal frames, statistics are collected on a number of subbands whose subband SNRs are less than the predetermined third threshold, and the fourth quantity is determined according to the quantity, such that a number of subbands found in most of these noise sample frames and whose subband SNRs are less than the third predetermined threshold is greater than the fourth number
[0208] Optionally, in another embodiment, whether the input audio signal is an audio signal to be determined may be determined by determining whether the input audio signal is a non-speech signal. In this case, it is not necessary to determine the subband SNR of the audio signal when determining whether the audio signal is an audio signal to be determined. In other words, operation 201 need not be performed when determining whether the audio signal is an audio signal to be determined. Specifically, determining an input audio signal as an audio signal to be determined includes: determining the audio signal as an audio signal to be determined in a case where it is determined that the input signal audio is a non-speech signal. Specifically, one skilled in the art can understand that there may be multiple methods of detecting whether the audio signal is a non-speech signal. For example, whether the audio signal is a non-speech signal can be determined by detecting a time domain ZCR of the audio signal. Specifically, in a case where the ZCR of the audio signal is greater than a ZCR threshold, the audio signal is determined to be a non-speech signal, where the ZCR threshold is determined according to a large number of experiments.
[0210] 402. Acquire a reference SSNR of the audio signal.
[0212] Specifically, the reference SSNR may be an SSNR obtained through calculation using formula 1.1.
[0214] 403. Using a predetermined algorithm to reduce a reference VAD decision threshold, in order to obtain a reduced VAD decision threshold.
[0216] Specifically, the reference VAD decision threshold may be a default VAD decision threshold, and the reference VAD decision threshold may be pre-stored or may be temporarily obtained through computation, where the default VAD decision threshold reference may be calculated using existing well known technology When the reference VAD decision threshold is lowered using the default algorithm, the default algorithm may be multiplying the reference VAD decision threshold by a coefficient that is less than 1, or you can use another algorithm. This embodiment of the present invention imposes no limitation on a specific algorithm used. The VAD decision threshold can be appropriately lowered using the predetermined algorithm such that an improved SSNR is greater than the lowered VAD decision threshold. Therefore, an erroneous detection rate of an active signal can be reduced.
[0218] 404. Compare the reference SSNR with the reduced VAD decision threshold to determine if the audio signal is an active signal.
[0220] When a conventional SSNR calculation method is used to calculate the SSNRs of some audio signals, the SSNRs of these audio signals may be lower than a predetermined VAD decision threshold. However, in reality, these audio signals are active audio signals. This is caused by the characteristics of these audio signals. For example, in a case where an ambient SNR is relatively low, a subband SNR of a high-frequency part is significantly lowered. Furthermore, since a psychoacoustic theory is generally used to perform a subband division, the subband SNR of the high-frequency part has a relatively low contribution to an SSNR. In this case, for some signals, such as a non-speech signal, whose energy is mainly centered in a relatively high-frequency part, an SSNR obtained through calculation using the conventional SSNR calculation method may be lower than the decision threshold of VAD, which causes an erroneous detection of an active signal. For another example, for some audio signals, the energy distribution of these audio signals is relatively flat in a spectrum, but the overall energy of these audio signals is relatively low. Therefore, in the case where an ambient SNR is relatively low, an SSNR obtained through calculation using the conventional SSNR calculation method may be lower than the VAD decision threshold. In the method shown in fig. 4, a way of lowering a VAD decision threshold is used, such that an SSNR obtained through calculation using the conventional SSNR calculation method is larger than the VAD decision threshold. Therefore, an erroneous detection rate of an active signal can be effectively reduced.
[0222] the fig. 5 is a structural block diagram of an apparatus according to an embodiment of the present invention. The apparatus shown in fig. 5 can perform all the operations shown in fig. 1 or in fig. 2. As shown in fig. 5, an apparatus 500 includes a first determination unit 501, a second determination unit 502, and a third determination unit 503.
[0224] The first determination unit 501 is configured to determine an input audio signal as an audio signal to be determined.
[0226] The second determination unit 502 is configured to determine a segmental signal-to-noise ratio (SSNR) of the audio signal, where the enhanced SSNR is greater than a reference SSNR.
[0228] The third determination unit 503 is configured to compare the enhanced SSNR with a voice activity detection (VAD) decision threshold to determine whether the audio signal is an active signal.
[0230] The apparatus 500 shown in FIG. 5 can determine a characteristic of an input audio signal, determine an enhanced SSNR in a corresponding manner according to the characteristic of the audio signal, and compare the enhanced SSNR with a VAD decision threshold, so that a ratio can be reduced. erroneous detection of an active signal.
[0232] Optionally, in one embodiment, the first determination unit 501 is specifically configured to determine the audio signal as an audio signal to be determined according to a subband SNR of the audio signal.
[0234] Optionally, in one embodiment, in a case where the first determining unit 501 determines the audio signal as an audio signal to be determined according to the subband SNR of the audio signal, the first determining unit 501 is specifically configured to determine the audio signal as an audio signal to be determined in a case where a number of high-frequency portion subbands that are in the audio signal and whose subband SNRs are greater than a first predetermined threshold is greater than a first amount.
[0236] Optionally, in another embodiment, in a case where the first determining unit 501 determines the audio signal as an audio signal to be determined according to the subband SNR of the audio signal, the first determining unit 501 is specifically configured to determine the audio signal as an audio signal to be determined in a case where a number of high-frequency portion subbands that are in the audio signal and whose subband SNRs are greater than a first predetermined threshold is greater than a second amount and a number of low frequency extremity subbands found in the audio signal and whose subband SNRs are less than a second predetermined threshold is greater than a third amount.
[0238] Optionally, in another embodiment, in a case where the first determining unit 501 determines the audio signal as an audio signal to be determined according to the subband SNR of the audio signal, the first determining unit 501 is specifically configured to determine the audio signal as an audio signal to be determined in a case where a number of subbands that are in the audio signal and whose subband SNR values are greater than one-third default threshold is greater than a fourth amount.
[0240] Optionally, in another embodiment, the first determination unit 501 is specifically configured to determine the audio signal as an audio signal to be determined in a case where the audio signal is determined to be a non-speech signal. Specifically, one skilled in the art can understand that there may be multiple methods of detecting whether the audio signal is a non-speech signal. For example, whether the audio signal is a non-speech signal can be determined by detecting a time domain ZCR of the audio signal. Specifically, in a case where the ZCR of the audio signal is greater than a ZCR threshold, the audio signal is determined to be a non-speech signal, where the ZCR threshold is determined according to a large number of experiments.
[0242] The first predetermined threshold and the second predetermined threshold can be obtained by collecting statistics based on a large number of speech samples. Specifically, statistics on the subband SNRs of the high-frequency portion subbands are collected on a large number of non-speech samples, including background noise, and the first predetermined threshold is determined based on the subband SNRs, such that the subband SNRs of most of the high-frequency portion subbands in these non-speech samples are greater than the first predetermined threshold. Similarly, statistics on the subband SNRs of the low-frequency extremity subbands are collected on these non-speech samples, and the second predetermined threshold is determined based on the subband SNRs, such that the subband SNRs of the majority of the low-frequency extremity subbands in these unvoiced samples are less than the second predetermined threshold.
[0244] The third default threshold is also obtained by collecting statistics. Specifically, the third predetermined threshold is determined according to the subband SNRs of a large number of noise signals, such that the subband SNRs of most of the subbands in these noise signals are less than the third predetermined threshold.
[0246] The first amount, the second amount, the third amount and the fourth amount are also obtained by collecting statistics. The first quantity is used as an example, where on a large number of speech samples, including noise, statistics are collected on a number of subbands of high-frequency portion subbands whose subband SNRs are greater than the first predetermined threshold, and the first number is determined by the number such that a number of high-frequency portion subbands found in most of these speech samples and whose subband SNRs are greater than the first threshold Default is greater than the first amount. A method of determining the second amount is similar to a method of determining the first amount. The second amount may be the same as the first amount, or it may be different from the first amount. Similarly, for the third quantity, on the large number of speech samples, including noise, statistics are collected on a number of subbands of low-frequency extremity subbands whose subband SNRs are greater than the second predetermined threshold, and the third number is determined by number such that a number of low-frequency extremity subbands found in most of these speech samples and whose subband SNRs are greater than the predetermined second threshold is greater than the third number . For the fourth quantity, on the large number of speech samples, including noise, statistics are collected on a number of subbands whose subband SNRs are greater than the predetermined third threshold, and the fourth quantity is determined according to the quantity, by way that a number of subbands found in most of these speech samples and whose subband SNRs are greater than the third predetermined threshold is greater than the fourth number.
[0248] Further, the second determination unit 502 is specifically configured to determine a weight of a subband SNR of each subband in the audio signal, where the weight of a high-frequency portion subband whose subband SNR is greater than the first threshold is greater than the weight of a subband SNR of another subband, and determining the enhanced SSNR according to the SNR of each subband and the weight of the subband SNR of each subband in the audio signal.
[0250] Optionally, in one embodiment, the second determination unit 502 is specifically configured to determine a reference SSNR of the audio signal, and determine the enhanced SSNR according to the reference SSNR of the audio signal.
[0252] The reference SSNR can be an SSNR obtained through calculation using formula 1.1. When calculating the reference SSNR, the weights of the subband SNRs that are from all subbands and that are included in the SSNR are the same in the SSNR.
[0254] Optionally, in another embodiment, the second determination unit 502 is specifically configured to determine the enhanced SSNR using the following formula:
[0256] Formula 1.7
[0257] where SSNR indicates the reference SSNR, SSNR' indicates the enhanced SSNR, and x and y indicate the enhancement parameters. For example, a value of x may be 1.05, and a value of y may be 1. One skilled in the art can understand that the values of x and y may be other suitable values that make the enhanced SSNR greater than the SSNR. reference correctly.
[0259] Optionally, in another embodiment, the second determination unit 502 is specifically configured to determine the enhanced SSNR using the following formula:
[0261] Formula 1.8
[0262] where SSNR indicates the reference SSNR, SSNR' indicates the enhanced SSNR, and /(x) and h(y) indicate enhancement functions. For example, /(x) and h(y) may be functions related to the LSNR of the audio signal, where the LSNR of the audio signal is an average SNR or a weighted SNR over a relatively long period of time. For example, when lsnr is greater than 20, f(lsnr) can be equal to 1.1, and y(lsnr) can be equal to 2; when lsnr is less than 20 and greater than 15, f(lsnr) may be equal to 1.05, and y(lsnr) may be equal to 1; and when lsnr is less than 15, f(lsnr) may be equal to 1, ey(lsnr) may be equal to 0. One skilled in the art can understand that /(x) and h(y) may be in other suitable forms that make the Enhanced SSNR greater than the Reference SSNR correctly.
[0264] The third determination unit 503 is specifically configured to compare the enhanced SSNR with the VAD decision threshold to determine, according to a result of the comparison, whether the audio signal is an active signal. Specifically, if the enhanced SSNR is greater than the VAD decision threshold, the audio signal is determined to be an active signal, or if the enhanced SSNR is less than the VAD decision threshold, the audio signal is determined to be active. it is an inactive signal.
[0266] Optionally, in another embodiment, a predetermined algorithm can also be used to reduce a reference VAD decision threshold to obtain a reduced VAD decision threshold, and the reduced VAD decision threshold is used to determine whether the audio signal it is an active signal. In this case, the apparatus 500 may further include a fourth determination unit 504, where the fourth determination unit 504 is configured to use a predetermined algorithm to reduce the VAD decision threshold to obtain a VAD decision threshold. reduced VAD. In this case, the third determination unit 503 is specifically configured to compare the enhanced SSNR with the reduced VAD decision threshold to determine whether the audio signal is an active signal.
[0267] the fig. 6 is a structural block diagram of another apparatus according to an embodiment of the present invention. The apparatus shown in fig. 6 can perform all the operations shown in fig. 3. As shown in fig.
[0268] 6, an apparatus 600 includes a first determination unit 601, a second determination unit 602, and a third determination unit 603.
[0270] The first determination unit 601 is configured to determine an input audio signal as an audio signal to be determined.
[0272] The second determining unit 602 is configured to determine a weight of a subband SNR of each subband in the audio signal, where a weight of a subband SNR of a high-frequency portion subband whose subband SNR is greater than a first predetermined threshold is greater than a weight of a subband SNR of another subband and determine an enhanced SSNR according to the subband SNR of each subband and the weight of the subband SNR of each subband in the audio signal, where the enhanced SSNR is greater than a reference SSNR.
[0274] The third determination unit 603 is configured to compare the enhanced SSNR with a VAD decision threshold to determine whether the audio signal is an active signal.
[0276] The apparatus 600 shown in FIG. 6 can determine a characteristic of an input audio signal, determine an enhanced SSNR in a corresponding manner according to the characteristic of the audio signal, and compare the enhanced SSNR with a VAD decision threshold, so that a proportion of erroneous detection of an active signal.
[0278] Furthermore, the first determination unit 601 is specifically configured to determine the audio signal as an audio signal to be determined according to a subband SNR of the audio signal.
[0280] Optionally, in one embodiment, the first determination unit 601 is specifically configured to determine the audio signal as an audio signal to be determined in a case where a number of high-frequency portion sub-bands that are in the audio signal and whose subband SNRs are greater than the first predetermined threshold is greater than a first amount.
[0282] Optionally, in another embodiment, the first determination unit 601 is specifically configured to determine the audio signal as an audio signal to be determined in a case where a number of high-frequency portion sub-bands that are in the audio signal and whose subband SNRs are greater than the first predetermined threshold is greater than a second number, and a number of low-frequency extremity subbands that are in the audio signal and whose subband SNRs are less than a second predetermined threshold is greater than a third amount.
[0284] The first predetermined threshold and the second predetermined threshold can be obtained by collecting statistics based on a large number of speech samples. Specifically, statistics on the subband SNRs of the high-frequency portion subbands are collected on a large number of non-speech samples, including background noise, and the first predetermined threshold is determined based on the subband SNRs, such that the subband SNRs of most of the high-frequency portion subbands in these non-speech samples are greater than the first predetermined threshold. Similarly, statistics on the subband SNRs of the low-frequency extremity subbands are collected on these non-speech samples, and the second predetermined threshold is determined based on the subband SNRs, such that the subband SNRs of the majority of the low-frequency extremity subbands in these unvoiced samples are less than the second predetermined threshold.
[0286] The first amount, the second amount and the third amount are also obtained by collecting statistics. The first quantity is used as an example, where in a large number of non-voice sample frames, including noise, statistics are collected on a number of high-frequency portion subband subbands whose subband SNRs are greater than the first threshold predetermined, and the first number is determined by the number, such that a number of high-frequency portion subbands found in most of these non-speech sample frames and whose subband SNRs are greater than the first predetermined threshold is greater than the first amount. A method of acquiring the second amount is similar to a method of acquiring the first amount. The second amount may be the same as the first amount, or the second amount may be different from the first amount. Similarly, for the third quantity, in the large number of sample frames without speech, including noise, statistics are collected on a number of low-frequency extremity subband subbands whose subband SNRs are less than the second predetermined threshold , and the third quantity is determined by the quantity such that a number of low-frequency extremity subbands found in most of these non-speech sample frames and whose subband SNRs are less than the second predetermined threshold is greater than the third amount.
[0288] the fig. 7 is a structural block diagram of an apparatus according to an embodiment of the present invention. The apparatus shown in fig. 7 can perform all the operations shown in fig. 1 or in fig. 2. As shown in fig. 7, an apparatus 700 includes a processor 701 and a memory 702. The processor 701 may be a general purpose processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field-programmable gate array (FPGA) or other programmable logic component, a discrete gate or transistor logic component, or a discrete hardware component, that can implement or perform the methods, operations, and diagrams of logical blocks described in embodiments of the present invention. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The operations of the methods described in embodiments of the present invention may be executed directly by a hardware decoding processor, or executed by a combination of hardware and software modules in a decoding processor. The software module may be located on a storage medium mature in the art, such as random access memory (RAM), flash memory, read-only memory (ROM), programmable read-only memory, programmable memory. that can be electrically erased, or a record. The storage medium is located in the memory 702. The processor 701 reads an instruction from the memory 702 and completes the operations of the above methods in conjunction with the hardware.
[0290] Processor 701 is configured to determine an input audio signal as an audio signal to be determined.
[0292] Processor 701 is configured to determine an enhanced SSNR of the audio signal, where the enhanced SSNR is greater than a reference SSNR.
[0294] Processor 701 is configured to compare the enhanced SSNR to a VAD decision threshold to determine if the audio signal is an active signal.
[0296] The apparatus 700 shown in FIG. 7 can determine a characteristic of an input audio signal, determine an enhanced SSNR in a corresponding manner according to the characteristic of the audio signal, and compare the enhanced SSNR with a VAD decision threshold, so that a ratio can be reduced. erroneous detection of an active signal.
[0298] Optionally, in one embodiment, the processor 701 is specifically configured to determine the audio signal as an audio signal to be determined according to a subband SNR of the audio signal.
[0300] Optionally, in one embodiment, in a case where the processor 701 determines the audio signal as an audio signal to be determined according to the subband SNR of the audio signal, the processor 701 is specifically configured to determine the subband SNR of the audio signal. audio signal as an audio signal to be determined in a case where a number of high-frequency portion subbands that are in the audio signal and whose subband SNRs are greater than a first predetermined threshold is greater than a first quantity.
[0301] Optionally, in another embodiment, in a case where the processor 701 determines the audio signal as an audio signal to be determined according to the subband SNR of the audio signal, the processor 701 is specifically configured to determine the subband SNR of the audio signal. audio signal as an audio signal to be determined in a case where a number of high-frequency portion subbands that are in the audio signal and whose subband SNRs are greater than a first predetermined threshold is greater than a second number, and a number of low-frequency extremity subbands found in the audio signal and whose subband SNRs are less than a second predetermined threshold is greater than a third number.
[0303] Optionally, in another embodiment, in a case where the processor 701 determines the audio signal as an audio signal to be determined according to the subband SNR of the audio signal, the processor 701 is specifically configured to determine the subband SNR of the audio signal. audio signal as an audio signal to be determined in a case where a number of subbands found in the audio signal and whose subband SNR values are greater than a predetermined third threshold is greater than a fourth amount.
[0305] Optionally, in another embodiment, the processor 701 is specifically configured to determine the audio signal as an audio signal to be determined in a case where the audio signal is determined to be a non-speech signal. Specifically, one skilled in the art can understand that there may be multiple methods of detecting whether the audio signal is a non-speech signal. For example, whether the audio signal is a non-speech signal can be determined by detecting a time domain ZCR of the audio signal. Specifically, in a case where the ZCR of the audio signal is greater than a ZCR threshold, the audio signal is determined to be a non-speech signal, where the ZCR threshold is determined according to a large number of experiments.
[0307] The first predetermined threshold and the second predetermined threshold can be obtained by collecting statistics based on a large number of speech samples. Specifically, statistics on the subband SNRs of the high-frequency portion subbands are collected on a large number of non-speech samples, including background noise, and the first predetermined threshold is determined based on the subband SNRs, so that the Subband SNRs of most of the high-frequency portion subbands in these non-voiced samples are greater than the first predetermined threshold. Similarly, statistics are collected on the subband SNRs of the low-frequency extremity subbands in these non-speech samples, and the second predetermined threshold is determined based on the subband SNRs, such that the subband SNRs of the majority of the low-frequency extremity subbands in these unvoiced samples are less than the second predetermined threshold.
[0308] The third default threshold is also obtained by collecting statistics. Specifically, the third predetermined threshold is determined according to the subband SNRs of a large number of noise signals, such that the subband SNRs of most of the subbands in these noise signals are less than the third predetermined threshold.
[0310] The first amount, the second amount, the third amount and the fourth amount are also obtained by collecting statistics. The first quantity is used as an example, where on a large number of speech samples, including noise, statistics are collected on a number of subbands of high-frequency portion subbands whose subband SNRs are greater than the first predetermined threshold, and the first number is determined by the number such that a number of high-frequency portion subbands found in most of these speech samples and whose subband SNRs are greater than the first predetermined threshold is greater than the first amount. A method of determining the second amount is similar to a method of determining the first amount. The second amount may be the same as the first amount, or it may be different from the first amount. Similarly, for the third quantity, on the large number of speech samples, including noise, statistics are collected on a number of subbands of low-frequency extremity subbands whose subband SNRs are greater than the second predetermined threshold, and the third number is determined by number such that a number of low-frequency extremity subbands found in most of these speech samples and whose subband SNRs are greater than the predetermined second threshold is greater than the third number . For the fourth quantity, on the large number of speech samples including noise, statistics are collected on a number of subbands whose subband SNRs are greater than the predetermined third threshold, and the fourth quantity is determined according to the quantity, so that a number of subbands found in most of these speech samples and whose subband SNRs are greater than the third predetermined threshold is greater than the fourth number.
[0312] In addition, processor 701 is specifically configured to determine a weight of a subband SNR of each subband in the audio signal, where a weight of a high-frequency portion subband whose subband SNR is greater than the first predetermined threshold is greater than a weight of a subband SNR of another subband and determine the enhanced SSNR according to the SNR of each subband and the weight of the subband SNR of each subband in the audio signal.
[0314] Optionally, in one embodiment, processor 701 is specifically configured to determine a reference SSNR of the audio signal, and determine the enhanced SSNR according to the reference SSNR of the audio signal.
[0315] The reference SSNR can be an SSNR obtained through calculation using formula 1.1. When calculating the reference SSNR, the weights of the subband SNRs that are from all subbands and that are included in the SSNR are the same in the SSNR.
[0317] Optionally, in another embodiment, processor 701 is specifically configured to determine the enhanced SSNR using the following formula:
[0319] Formula 1.7
[0320] where SSNR indicates the reference SSNR, SSNR' indicates the enhanced SSNR, and x and y indicate the enhancement parameters. For example, a value of x may be 1.07, and a value of y may be 1. One skilled in the art can understand that the values of x and y may be other suitable values that make the enhanced SSNR greater than the SSNR. reference correctly.
[0322] Optionally, in another embodiment, processor 701 is specifically configured to determine the enhanced SSNR using the following formula:
[0324] Formula 1.8
[0325] where SSNR indicates the reference SSNR, SSNR' indicates the enhanced SSNR, and /(x) and h(y) indicate enhancement functions. For example, /(x) and h(y) may be functions related to an LSNR of the audio signal, where the LSNR of the audio signal is an average SNR or a weighted SNR within a relatively long period of time. For example, when lsnr is greater than 20, f(lsnr) can be equal to 1.1, and y(lsnr) can be equal to 2; when lsnr is less than 20 and greater than 17, f(lsnr) may be equal to 1.07, and y(lsnr) may be equal to 1; and when lsnr is less than 17, f(lsnr) may be equal to 1 and y(lsnr) may be equal to 0. One skilled in the art can understand that /(x) and h(y) may be in other suitable forms that make the Enhanced SSNR greater than the Reference SSNR correctly.
[0327] Processor 701 is specifically configured to compare the enhanced SSNR with the VAD decision threshold to determine, based on a result of the comparison, whether the audio signal is an active signal. Specifically, if the enhanced SSNR is greater than the VAD decision threshold, the audio signal is determined to be an active signal, or if the enhanced SSNR is less than the VAD decision threshold, the audio signal is determined to be active. it is an inactive signal.
[0328] Optionally, in another embodiment, a predetermined algorithm can also be used to reduce a reference VAD decision threshold to obtain a reduced VAD decision threshold, and the reduced VAD decision threshold is used to determine whether the audio signal it is an active signal. In this case, the processor 701 may be further configured to use a predetermined algorithm to reduce the VAD decision threshold, in order to obtain a reduced VAD decision threshold. In this case, processor 701 is specifically configured to compare the enhanced SSNR with the reduced VAD decision threshold to determine if the audio signal is an active signal.
[0330] the fig. 8 is a structural block diagram of another apparatus according to an embodiment of the present invention. The apparatus shown in fig. 8 can perform all the operations shown in fig. 3. As shown in fig.
[0331] 8, an apparatus 800 includes a processor 801 and memory 802. The processor 801 may be a general purpose processor, a DSP, an ASIC, an FPGa or other programmable logic component, a discrete gate or transistor logic component, or a discrete hardware component, which can implement or perform the methods, operations, and logical block diagrams described in embodiments of the present invention. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The operations of the methods described in embodiments of the present invention may be executed directly by a hardware decoding processor, or executed by a combination of hardware and software modules in a decoding processor. The software module may be located in a storage medium mature in the art, such as RAM, flash memory, ROM, programmable read-only memory, electrically erasable programmable memory, or a register. The storage medium is located in the memory 802. The processor 801 reads an instruction from the memory 802 and completes the operations of the above methods in conjunction with the hardware.
[0333] Processor 801 is configured to determine an input audio signal as an audio signal to be determined.
[0335] Processor 801 is configured to determine a weight of a subband SNR of each subband in the audio signal, where a weight of a subband SNR of a high frequency portion subband whose subband SNR is greater than a first predetermined threshold is greater than a weight of a subband SNR of another subband and determine an enhanced SSNR based on the subband SNR of each subband and the weight of the subband SNR of each subband in the audio signal, where the enhanced SSNR is greater than a reference SSNR.
[0336] Processor 801 is configured to compare the enhanced SSNR to a VAD decision threshold to determine if the audio signal is an active signal.
[0338] The apparatus 800 shown in FIG. 8 can determine a characteristic of an input audio signal, determine an enhanced SSNR correspondingly according to the characteristic of the audio signal, and compare the enhanced SSNR with a VAD decision threshold, so that a detection ratio can be reduced. error of an active signal.
[0340] Furthermore, the processor 801 is specifically configured to determine the audio signal as an audio signal to be determined according to a subband SNR of the audio signal.
[0342] Optionally, in one embodiment, the processor 801 is specifically configured to determine the audio signal as an audio signal to be determined in a case where a number of high-frequency portion subbands found in the audio signal audio and whose subband SNRs are greater than the first predetermined threshold is greater than a first amount.
[0344] Optionally, in another embodiment, the processor 801 is specifically configured to determine the audio signal as an audio signal to be determined in a case where a number of high-frequency portion subbands found in the audio signal audio and whose subband SNRs are greater than the first predetermined threshold is greater than a second number, and a number of low-frequency extremity subbands that are in the audio signal and whose subband Nrs are less than a second threshold default is greater than a third amount.
[0346] The first predetermined threshold and the second predetermined threshold can be obtained by collecting statistics based on a large number of speech samples. Specifically, statistics on the subband SNRs of high-frequency portion subbands are collected on a large number of non-speech samples, including background noise, and the first predetermined threshold is determined based on the subband SNRs, such that the SNRs of the subband majority of high-frequency portion subbands in these non-voiced samples are greater than the first predetermined threshold. Similarly, statistics are collected on the subband SNRs of the low-frequency extremity subbands in these non-speech samples, and the second predetermined threshold is determined based on the subband SNRs, such that the subband SNRs of the majority of the low-frequency extremity subbands in these unvoiced samples are less than the second predetermined threshold.
[0347] The first amount, the second amount and the third amount are also obtained by collecting statistics. The first quantity is used as an example, where in a large number of sample frames without speech, including noise, statistics are collected on a number of high-portion subband subbands. whose subband SNRs are greater than the first predetermined threshold, and the first number is determined by the number, such that a number of high-frequency portion subbands that are found in most of these nonvoice sample frames and whose Subband SNR are greater than the first predetermined threshold is greater than the first amount. A method of acquiring the second amount is similar to a method of acquiring the first amount. The second amount may be the same as the first amount, or the second amount may be different from the first amount. Similarly, for the third quantity, in the large number of sample frames without speech, including noise, statistics are collected on a number of low-frequency extremity subband subbands whose subband SNRs are less than the second predetermined threshold , and the third quantity is determined by the quantity such that a number of low-frequency extremity subbands found in most of these non-speech sample frames and whose subband SNRs are less than the second predetermined threshold is greater than the third amount.
[0349] the fig. 9 is a structural block diagram of another apparatus according to an embodiment of the present invention. An apparatus 900 shown in FIG. 9 can perform all the operations shown in fig. 4. As shown in fig. 9, the apparatus 900 includes a first determination unit 901, a second determination unit 902, a third determination unit 903, and a fourth determination unit 904.
[0351] The first determining unit 901 is configured to determine an input audio signal as an audio signal to be determined.
[0353] The second determining unit 902 is configured to acquire a reference SSNR of the audio signal.
[0354] Specifically, the reference SSNR may be an SSNR obtained through calculation using formula 1.1.
[0355] The third determining unit 903 is configured to use a predetermined algorithm to reduce a reference VAD decision threshold, in order to obtain a reduced VAD decision threshold.
[0357] Specifically, the reference VAD decision threshold may be a default VAD decision threshold, and the reference VAD decision threshold may be pre-stored or may be temporarily obtained through computation, where the default VAD decision threshold reference can be calculated using well-known existing technology. When the reference VAD decision threshold is lowered using the predetermined algorithm, the predetermined algorithm may be multiplying the reference VAD decision threshold by a coefficient that is less than 1, or another algorithm may be used. This embodiment of the present invention imposes no limitation on a specific algorithm used. The VAD decision threshold can be appropriately lowered using the predetermined algorithm such that the enhanced SSNR is greater than the lowered VAD decision threshold. Therefore, an erroneous detection rate of an active signal can be reduced.
[0358] The fourth determining unit 904 is configured to compare the reference SSNR with the reduced VAD decision threshold to determine if the audio signal is an active signal.
[0360] Optionally, in one embodiment, the first determination unit 901 is specifically configured to determine the audio signal as an audio signal to be determined according to a subband SNR of the audio signal.
[0362] Optionally, in one embodiment, in a case where the first determining unit 901 determines the audio signal as an audio signal to be determined according to the subband SNR of the audio signal, the first determining unit 901 is specifically configured to determine the audio signal as an audio signal to be determined in a case where a number of high-frequency portion subbands that are in the audio signal and whose subband SNRs are greater than a first predetermined threshold is greater than a first amount.
[0364] Optionally, in one embodiment, in a case where the first determining unit 901 determines the audio signal as an audio signal to be determined according to the subband SNR of the audio signal, the first determining unit 901 is specifically configured to determine the audio signal as an audio signal to be determined in a case where a number of high-frequency portion subbands that are in the audio signal and whose subband SNRs are greater than a first predetermined threshold is greater than a second amount, and a number of low-frequency extremity subbands found in the audio signal and whose subband SNRs are less than a second predetermined threshold is greater than a third amount.
[0366] Optionally, in one embodiment, in a case where the first determining unit 901 determines the audio signal as an audio signal to be determined according to the subband SNR of the audio signal, the first determining unit 901 is specifically configured to determine the audio signal as an audio signal to be determined in a case where a number of subbands that are in the audio signal and whose subband SNR values are greater than one-third predetermined threshold is greater than a fourth amount.
[0368] Optionally, in one embodiment, the first determination unit 901 is specifically configured to determining the audio signal as an audio signal to be determined in a case where the audio signal is determined to be a non-speech signal. Specifically, one skilled in the art can understand that there may be multiple methods of detecting whether the audio signal is a non-speech signal. For example, whether the audio signal is a non-speech signal can be determined by detecting a ZCR of the audio signal. Specifically, in a case where the ZCR of the audio signal is greater than a ZCR threshold, the audio signal is determined to be a non-speech signal, where the ZCR threshold is determined according to a large number of experiments.
[0370] The first predetermined threshold and the second predetermined threshold can be obtained by collecting statistics based on a large number of speech samples. Specifically, statistics on the subband SNRs of the high-frequency portion subbands are collected on a large number of non-speech samples, including background noise, and the first predetermined threshold is determined based on the subband SNRs, so that the Subband SNRs of most of the high-frequency portion subbands in these non-voiced samples are greater than the first predetermined threshold. Similarly, statistics are collected on the subband SNRs of the low-frequency extremity subbands in these non-speech samples, and the second predetermined threshold is determined based on the subband SNRs, such that the subband SNRs of the majority of the low-frequency extremity subbands in these unvoiced samples are less than the second predetermined threshold.
[0371] The third default threshold is also obtained by collecting statistics. Specifically, the third predetermined threshold is determined according to the subband SNRs of a large number of noise signals, such that the subband SNRs of most of the subbands in these noise signals are less than the third predetermined threshold.
[0373] The first amount, the second amount, the third amount and the fourth amount are also obtained by collecting statistics. The first quantity is used as an example, where on a large number of speech samples, including noise, statistics are collected on a number of subbands of high-frequency portion subbands whose subband SNRs are greater than the first predetermined threshold, and the first number is determined by the number such that a number of high-frequency portion subbands found in most of these speech samples and whose subband SNRs are greater than the first predetermined threshold is greater than the first amount . A method of determining the second amount is similar to a method of determining the first amount. The second amount may be the same as the first amount, or it may be different from the first amount. Similarly, for the third quantity, on the large number of speech samples, including noise, statistics are collected on a number of subbands of low-frequency extremity subbands whose subband SNRs are greater than the second predetermined threshold, and the third number is determined by number such that a number of low-frequency extremity subbands found in most of these speech samples and whose subband SNRs are greater than the predetermined second threshold is greater than the third number . For the fourth quantity, on the large number of speech samples including noise, statistics are collected on a number of subbands whose subband SNRs are greater than the predetermined third threshold, and the fourth quantity is determined according to the quantity, so that a number of subbands found in most of these speech samples and whose subband SNRs are greater than the third predetermined threshold is greater than the fourth number.
[0375] The apparatus 900 shown in FIG. 9 can determine a characteristic of an input audio signal, lower a reference VAD decision threshold based on the audio signal characteristic, and compare an improved SSNR with a lowered VAD decision threshold, such that a VAD decision threshold can be reduced. erroneous detection rate of an active signal.
[0377] the fig. 10 is a structural block diagram of another apparatus according to an embodiment of the present invention. An apparatus 1000 shown in FIG. 10 can perform all the operations shown in fig. 4. As shown in FIG. 10, apparatus 1000 includes a processor 1001 and memory 1002. Processor 1001 may be a general purpose processor, a DSP, an ASIC, an FPGA or other programmable logic component, a discrete gate or transistor logic component, or a discrete hardware component, which can implement or perform the methods, operations, and logical block diagrams described in embodiments of the present invention. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The operations of the methods described in embodiments of the present invention may be executed directly by a hardware decoding processor, or executed by a combination of hardware and software modules in a decoding processor. The software module may be located in a storage medium mature in the art, such as RAM, flash memory, ROM, programmable read-only memory, electrically erasable programmable memory, or a register. The storage medium is located in the memory 1002. The processor 1001 reads an instruction from the memory 1002 and completes the operations of the above methods in conjunction with the hardware.
[0379] Processor 1001 is configured to determine an input audio signal as an audio signal to be determined.
[0381] Processor 1001 is configured to acquire a reference SSNR of the audio signal.
[0383] Specifically, the reference SSNR may be an SSNR obtained through calculation using formula 1.1.
[0384] Processor 1001 is configured to use a predetermined algorithm to reduce a reference VAD decision threshold to obtain a reduced VAD decision threshold.
[0386] Specifically, the reference VAD decision threshold may be a default VAD decision threshold, and the reference VAD decision threshold may be pre-stored or may be temporarily obtained through computation, where the default VAD decision threshold reference can be calculated using well-known existing technology. When the reference VAD decision threshold is lowered using the predetermined algorithm, the predetermined algorithm may be multiplying the reference VAD decision threshold by a coefficient that is less than 1, or another algorithm may be used. This embodiment of the present invention imposes no limitation on a specific algorithm used. The VAD decision threshold can be appropriately lowered using the predetermined algorithm such that an improved SSNR is greater than the lowered VAD decision threshold. Therefore, an erroneous detection rate of an active signal can be reduced.
[0387] Processor 1001 is configured to compare the reference SSNR to the reduced VAD decision threshold to determine if the audio signal is an active signal.
[0389] Optionally, in one embodiment, the processor 1001 is specifically configured to determine the audio signal as an audio signal to be determined according to a subband SNR of the audio signal.
[0391] Optionally, in one embodiment, in a case where the processor 1001 determines the audio signal as an audio signal to be determined according to the subband SNR of the audio signal, the processor 1001 is specifically configured to determine the subband SNR of the audio signal. audio signal as an audio signal to be determined in a case where a number of high-frequency portion subbands that are in the audio signal and whose subband SNRs are greater than a first predetermined threshold is greater than a first amount.
[0392] Optionally, in one embodiment, in a case where the processor 1001 determines the audio signal as an audio signal to be determined according to the subband SNR of the audio signal, the processor 1001 is specifically configured to determine the subband SNR of the audio signal. audio signal as an audio signal to be determined in a case where a number of high-frequency portion subbands that are in the audio signal and whose subband SNRs are greater than a first predetermined threshold is greater than a second number, and a number of low-frequency extremity subbands found in the audio signal and whose subband SNRs are less than a second predetermined threshold is greater than a third number.
[0394] Optionally, in one embodiment, in a case where the processor 1001 determines the audio signal as an audio signal to be determined according to the subband SNR of the audio signal, the processor 1001 is specifically configured to determine the subband SNR of the audio signal. audio signal as an audio signal to be determined in a case where a number of subbands found in the audio signal and whose subband SNR values are greater than a predetermined third threshold is greater than a fourth amount.
[0396] Optionally, in one embodiment, the processor 1001 is specifically configured to determine the audio signal as an audio signal to be determined in a case where the audio signal is determined to be a non-speech signal. Specifically, one skilled in the art can understand that there may be multiple methods of detecting whether the audio signal is a non-speech signal. For example, whether the audio signal is a non-speech signal can be determined by detecting a ZCR of the audio signal. Specifically, in a case where the ZCR of the audio signal is greater than a ZCR threshold, the audio signal is determined to be a non-speech signal, where the ZCR threshold is determined according to a large number of experiments.
[0398] The first predetermined threshold and the second predetermined threshold can be obtained by collecting statistics based on a large number of speech samples. Specifically, statistics on the subband SNRs of high-frequency portion subbands are collected on a large number of non-speech samples, including background noise, and the first predetermined threshold is determined based on the subband SNRs, such that the SNRs of the majority of the high-frequency portion subbands in these non-voiced samples are greater than the first predetermined threshold. Similarly, statistics are collected on the subband SNRs of the low-frequency extremity subbands in these non-speech samples, and the second predetermined threshold is determined based on the subband SNRs, such that the subband SNRs of the majority of the low-frequency extremity subbands in these unvoiced samples are less than the second predetermined threshold.
[0399] The third default threshold is also obtained by collecting statistics. Specifically, the third predetermined threshold is determined according to the subband SNRs of a large number of noise signals, such that the subband SNRs of most of the subbands in these noise signals are less than the third predetermined threshold.
[0401] The first amount, the second amount, the third amount and the fourth amount are also obtained by collecting statistics. The first quantity is used as an example, where on a large number of speech samples, including noise, statistics are collected on a number of subbands of high-frequency portion subbands whose subband SNRs are greater than the first predetermined threshold, and the first number is determined by the number such that a number of high-frequency portion subbands found in most of these speech samples and whose subband SNRs are greater than the first threshold Default is greater than the first amount. A method of determining the second amount is similar to a method of determining the first amount. The second amount may be the same as the first amount, or it may be different from the first amount. Similarly, for the third quantity, on the large number of speech samples, including noise, statistics are collected on a number of subbands of low-frequency extremity subbands whose subband SNRs are greater than the second predetermined threshold, and the third number is determined by number such that a number of low-frequency extremity subbands found in most of these speech samples and whose subband SNRs are greater than the predetermined second threshold is greater than the third number . For the fourth quantity, on the large number of speech samples including noise, statistics are collected on a number of subbands whose subband SNRs are greater than the predetermined third threshold, and the fourth quantity is determined according to the quantity, so that a number of subbands found in most of these speech samples and whose subband SNRs are greater than the third predetermined threshold is greater than the fourth number.
[0403] The apparatus 1000 shown in FIG. 10 can determine a characteristic of an input audio signal, lower a reference VAD decision threshold according to the audio signal characteristic, and compare an improved SSNR with a lowered VAD decision threshold, such that a VAD decision threshold can be reduced. erroneous detection rate of an active signal.
[0405] A person skilled in the art can clearly understand that, for the purpose of a convenient and brief description, for a detailed working process of the above system, apparatus and unit, reference may be made to a corresponding process in the embodiments of the above method, and the details are not described here again.
[0407] In the various embodiments provided in the present application, it should be understood that the described system, apparatus, and method may be implemented in other ways. For example, the described embodiment of the apparatus is merely exemplary. For example, unit division is simply logical function division and may be another division in the actual implementation. For example, a plurality of units or components may be combined or integrated into another system, or some features may be ignored or not performed. Furthermore, the shown or described mutual couplings or direct couplings or communication connections can be implemented using some interfaces. Indirect couplings or communication connections between apparatuses or units can be implemented electronically, mechanically or in other ways.
[0409] Units described as separate parts may or may not be physically separate, and parts shown as units may or may not be physical units, may be located in one location, or may be distributed over a plurality of network units. Some or all of the units may be selected according to the actual needs to achieve the objectives of the solutions of the embodiments.
[0411] Furthermore, the functional units in the embodiments of the present invention may be integrated into a processing unit, or each of the units may exist only physically, or two or more units are integrated into one unit.
[0413] When the functions are implemented in the form of a functional unit of software and are sold or used as a stand-alone product, the functions may be stored on a computer-readable storage medium. Based on such an understanding, the technical solutions of the present invention essentially, or the part contributing to the prior art, or a part of the technical solutions can be implemented in the form of a software product. The software product is stored on a storage medium and includes various instructions to instruct a computing device (which may be a personal computer, server, or network device) or processor to perform all or part of the operations of the methods described in embodiments of the present invention. The above storage medium includes: any medium that can store program code, such as a USB flash drive, removable hard drive, ROM, RAM, magnetic disk, or optical disk.
[0414] The above descriptions are merely specific embodiments of the present invention, but are not intended to limit the present invention.

权利要求:
Claims (6)
[1]
1. A method of detecting an active signal, wherein the method comprises:
when an audio signal is determined to be a non-speech signal,
determining (102) an enhanced segmental signal-to-noise ratio, SSNR, of the audio signal, where the enhanced SSNR is greater than a reference SSNR and the reference SSNR is calculated by summing all subband SNRs of the audio signal ; Y
comparing (103) the enhanced SSNR to a voice activity detection decision threshold, VAD, to determine if the audio signal is an active signal,
wherein determining (102) the enhanced SSNR of the audio signal comprises:
determining the reference SSNR of the audio signal; Y
determining the enhanced SSNR according to the reference SSNR of the audio signal.
[2]
2. The method according to claim 1, wherein determining the enhanced SSNR according to the reference SSNR of the audio signal comprises:
Determine the improved SSNR using the following formula

[3]
3. A method of detecting an active signal, wherein the method comprises:
When an audio signal is determined to be a non-speech signal,
determining (302) a weight of a subband signal-to-noise ratio, SNR, of each subband in the audio signal, wherein a weight of a subband SNR of a high-frequency portion subband whose SNR is greater than a first predetermined threshold is greater than a weight of a subband SNR of another subband; determining (303) an enhanced segmental signal-to-noise ratio, SSNR, based on the subband SNR of each subband and the weight of the subband SNR of each subband in the audio signal, where the enhanced SSNR is greater than a reference SSNR and the reference SSNR is calculated by summing all subband SNRs of the audio signal;
comparing (304) the enhanced SSNR to a voice activity detection decision threshold, VAD, to determine if the audio signal is an active signal.
[4]
4. An apparatus for detecting an active signal, the apparatus comprising:
a first determining unit (501) configured to determine whether an audio signal is a non-speech signal; a second determining unit (502) configured to determine an enhanced segmental signal-to-noise ratio, SSNR, of the audio signal, where the enhanced SSNR is greater than a reference SSNR, and the reference SSNR is calculated by summing all SNRs subband of the audio signal; Y
a third determination unit (503), configured to compare the enhanced SSNR with a voice activity detection decision threshold, VAD, to determine whether the audio signal is an active signal, whether the audio signal is a blank signal voice;
wherein the second determining unit (502) is specifically configured to determine the reference SSNR of the audio signal, and determine the enhanced SSNR according to the reference SSNR of the audio signal.
[5]
The apparatus according to claim 4, wherein the second determination unit (502) is specifically configured to determine the enhanced SSNR using the following formula:

[6]
6. An apparatus for detecting an active signal, the apparatus comprising:
a first determining unit (601) configured to determine whether an audio signal is a non-speech signal; a second determining unit (602), configured to determine a weight of a subband signal-to-noise ratio, SNR, of each subband in the audio signal, wherein a weight of a subband SNR of a high portion subband frequency whose SNR is greater than a first predetermined threshold is greater than a weight of a subband SNR of another subband and determine an improved segmental signal-to-noise ratio, SSNR, according to the subband SNR of each subband and the weight of the SNR of subband of each subband in the audio signal, where the enhanced SSNR is greater than a reference SSNR, and the reference SSNR is calculated by summing all the subband SNRs of the audio signal; Y
a third determining unit (603), configured to compare the enhanced SSNR with a voice activity detection decision threshold, VAD, to determine whether the audio signal is an active signal, whether the audio signal is a blank signal voice.

类似技术:

公开号 | 公开日 | 专利标题

ES2787894T3|2020-10-19|Method and device for detecting the audio signal

ES2540075T3|2015-07-08|Transient voice coding method and device, decoding method and device, processing system and computer readable storage medium

ES2733099T3|2019-11-27|Systems, procedures and devices for signal change detection

ES2860986T3|2021-10-05|Method and apparatus for adaptively detecting a voice activity in an input audio signal

CN104067339B|2016-05-25|Noise-suppressing device

US9978398B2|2018-05-22|Voice activity detection method and device

US20170206916A1|2017-07-20|Voice Activity Detection Method and Apparatus

CN113724725B|2022-01-18|Bluetooth audio squeal detection suppression method, device, medium and Bluetooth device

CN104091593A|2014-10-08|Voice endpoint detection algorithm adopting perception spectrogram structure boundary parameter

Tian et al.2016|An Investigation of Spoofing Speech Detection Under Additive Noise and Reverberant Conditions.

JP6067930B2|2017-01-25|Automatic gain matching for multiple microphones

Ong et al.2016|Robust voice activity detection using gammatone filtering and entropy

Maganti et al.2012|A perceptual masking approach for noise robust speech recognition

CN111462757A|2020-07-28|Data processing method and device based on voice signal, terminal and storage medium

Song et al.2014|Voice Activity Detection Using Modified Power Spectral Deviation Based on Teager Energy

Liu et al.2013|Improved spectral subtraction speech enhancement algorithm

JP5179578B2|2013-04-10|Limiting distortion introduced by post-processing steps during decoding of digital signals

Pinto et al.2007|Speech modeling and noise removal using a perceptually modified Wiener filter

同族专利:

公开号 | 公开日

EP3118852A1|2017-01-18|

AU2014386442A1|2016-09-08|

MX2016011750A|2016-12-12|

JP2017511901A|2017-04-27|

AU2014386442B9|2017-11-23|

RU2666337C2|2018-09-06|

JP2019053321A|2019-04-04|

CA2940487C|2020-10-27|

KR20160120764A|2016-10-18|

JP6793706B2|2020-12-02|

CA2940487A1|2015-09-17|

MX355828B|2018-05-02|

US10818313B2|2020-10-27|

AU2014386442B2|2017-11-02|

KR102005009B1|2019-07-29|

EP3118852A4|2017-03-29|

CN107293287A|2017-10-24|

PT3118852T|2020-03-06|

US20160379670A1|2016-12-29|

EP3660845A1|2020-06-03|

ES2787894T3|2020-10-19|

US10304478B2|2019-05-28|

KR20180088503A|2018-08-03|

CN107086043A|2017-08-22|

CN104916292B|2017-05-24|

JP6493889B2|2019-04-03|

US20190279657A1|2019-09-12|

CN104916292A|2015-09-16|

RU2016139717A|2018-04-12|

SG11201607052SA|2016-10-28|

CN107086043B|2020-09-08|

US20200312353A1|2020-10-01|

WO2015135344A1|2015-09-17|

EP3118852B1|2020-02-12|

CN107293287B|2021-10-26|

KR101884220B1|2018-08-01|

引用文献:

公开号 | 申请日 | 公开日 | 申请人 | 专利标题

FI100840B|1995-12-12|1998-02-27|Nokia Mobile Phones Ltd|Noise attenuator and method for attenuating background noise from noisy speech and a mobile station|

US5991718A|1998-02-27|1999-11-23|At&T Corp.|System and method for noise threshold adaptation for voice activity detection in nonstationary noise environments|

US6466906B2|1999-01-06|2002-10-15|Dspc Technologies Ltd.|Noise padding and normalization in dynamic time warping|

US6453291B1|1999-02-04|2002-09-17|Motorola, Inc.|Apparatus and method for voice activity detection in a communication system|

US6324509B1|1999-02-08|2001-11-27|Qualcomm Incorporated|Method and apparatus for accurate endpointing of speech in the presence of noise|

JP2001236085A|2000-02-25|2001-08-31|Matsushita Electric Ind Co Ltd|Sound domain detecting device, stationary noise domain detecting device, nonstationary noise domain detecting device and noise domain detecting device|

JP3588030B2|2000-03-16|2004-11-10|三菱電機株式会社|Voice section determination device and voice section determination method|

US6898566B1|2000-08-16|2005-05-24|Mindspeed Technologies, Inc.|Using signal to noise ratio of a speech signal to adjust thresholds for extracting speech parameters for coding the speech signal|

CN1175398C|2000-11-18|2004-11-10|中兴通讯股份有限公司|Sound activation detection method for identifying speech and music from noise environment|

US7941313B2|2001-05-17|2011-05-10|Qualcomm Incorporated|System and method for transmitting speech activity information ahead of speech features in a distributed voice recognition system|

US7203643B2|2001-06-14|2007-04-10|Qualcomm Incorporated|Method and apparatus for transmitting speech activity in distributed voice recognition systems|

US6937980B2|2001-10-02|2005-08-30|Telefonaktiebolaget Lm Ericsson |Speech recognition using microphone antenna array|

JP4281349B2|2001-12-25|2009-06-17|パナソニック株式会社|Telephone equipment|

US7024353B2|2002-08-09|2006-04-04|Motorola, Inc.|Distributed speech recognition with back-end voice activity detection apparatus and method|

US7146315B2|2002-08-30|2006-12-05|Siemens Corporate Research, Inc.|Multichannel voice detection in adverse environments|

US7162420B2|2002-12-10|2007-01-09|Liberato Technologies, Llc|System and method for noise reduction having first and second adaptive filters|

JP4490090B2|2003-12-25|2010-06-23|株式会社エヌ・ティ・ティ・ドコモ|Sound / silence determination device and sound / silence determination method|

CA2454296A1|2003-12-29|2005-06-29|Nokia Corporation|Method and device for speech enhancement in the presence of background noise|

US8340309B2|2004-08-06|2012-12-25|Aliphcom, Inc.|Noise suppressing multi-microphone headset|

CN100369113C|2004-12-31|2008-02-13|中国科学院自动化研究所|Method for adaptively improving speech recognition rate by means of gain|

US8175877B2|2005-02-02|2012-05-08|At&T Intellectual Property Ii, L.P.|Method and apparatus for predicting word accuracy in automatic speech recognition systems|

EP1982324B1|2006-02-10|2014-09-24|Telefonaktiebolaget LM Ericsson |A voice detector and a method for suppressing sub-bands in a voice detector|

US8311814B2|2006-09-19|2012-11-13|Avaya Inc.|Efficient voice activity detector to detect fixed power signals|

CN101197130B|2006-12-07|2011-05-18|华为技术有限公司|Sound activity detecting method and detector thereof|

US7769585B2|2007-04-05|2010-08-03|Avidyne Corporation|System and method of voice activity detection in noisy environments|

CN101320559B|2007-06-07|2011-05-18|华为技术有限公司|Sound activation detection apparatus and method|

US8954324B2|2007-09-28|2015-02-10|Qualcomm Incorporated|Multiple microphone voice activity detector|

KR101335417B1|2008-03-31|2013-12-05|트란소노|Procedure for processing noisy speech signals, and apparatus and program therefor|

US8326620B2|2008-04-30|2012-12-04|Qnx Software Systems Limited|Robust downlink speech and noise detector|

US8768690B2|2008-06-20|2014-07-01|Qualcomm Incorporated|Coding scheme selection for low-bit-rate applications|

WO2010091339A1|2009-02-06|2010-08-12|University Of Ottawa|Method and system for noise reduction for speech enhancement in hearing aid|

JP5337530B2|2009-02-25|2013-11-06|京セラ株式会社|Radio base station and radio communication method|

KR20110001130A|2009-06-29|2011-01-06|삼성전자주식회사|Apparatus and method for encoding and decoding audio signals using weighted linear prediction transform|

CN102044242B|2009-10-15|2012-01-25|华为技术有限公司|Method, device and electronic equipment for voice activation detection|

CN102044243B|2009-10-15|2012-08-29|华为技术有限公司|Method and device for voice activity detection and encoder|

WO2011049516A1|2009-10-19|2011-04-28|Telefonaktiebolaget Lm Ericsson |Detector and method for voice activity detection|

WO2011049515A1|2009-10-19|2011-04-28|Telefonaktiebolaget Lm Ericsson |Method and voice activity detector for a speech encoder|

US8898058B2|2010-10-25|2014-11-25|Qualcomm Incorporated|Systems, methods, and apparatus for voice activity detection|

EP3252771B1|2010-12-24|2019-05-01|Huawei Technologies Co., Ltd.|A method and an apparatus for performing a voice activity detection|

ES2860986T3|2010-12-24|2021-10-05|Huawei Tech Co Ltd|Method and apparatus for adaptively detecting a voice activity in an input audio signal|

WO2012083552A1|2010-12-24|2012-06-28|Huawei Technologies Co., Ltd.|Method and apparatus for voice activity detection|

US9099098B2|2012-01-20|2015-08-04|Qualcomm Incorporated|Voice activity detection in presence of background noise|

JP5875609B2|2012-02-10|2016-03-02|三菱電機株式会社|Noise suppressor|

JP5862349B2|2012-02-16|2016-02-16|株式会社Ｊｖｃケンウッド|Noise reduction device, voice input device, wireless communication device, and noise reduction method|

CN103325380B|2012-03-23|2017-09-12|杜比实验室特许公司|Gain for signal enhancing is post-processed|

US20130282373A1|2012-04-23|2013-10-24|Qualcomm Incorporated|Systems and methods for audio signal processing|

US9524735B2|2014-01-31|2016-12-20|Apple Inc.|Threshold adaptation in two-channel noise estimation and voice activity detection|

CN107293287B|2014-03-12|2021-10-26|华为技术有限公司|Method and apparatus for detecting audio signal|

US9775113B2|2014-12-11|2017-09-26|Mediatek Inc.|Voice wakeup detecting device with digital microphone and associated method|CN107293287B|2014-03-12|2021-10-26|华为技术有限公司|Method and apparatus for detecting audio signal|

BR112017021239A2|2016-04-29|2018-06-26|Huawei Tech Co Ltd|method for determining voice input, handset, terminal and computational input exception|

CN107040359B|2017-05-08|2021-01-19|海能达通信股份有限公司|Method, device and equipment for carrying channel associated signaling in voice calling process|

CN107393559B|2017-07-14|2021-05-18|深圳永顺智信息科技有限公司|Method and device for checking voice detection result|

CN107393558B|2017-07-14|2020-09-11|深圳永顺智信息科技有限公司|Voice activity detection method and device|

CN107393550B|2017-07-14|2021-03-19|深圳永顺智信息科技有限公司|Voice processing method and device|

CN107393553B|2017-07-14|2020-12-22|深圳永顺智信息科技有限公司|Auditory feature extraction method for voice activity detection|

法律状态:

优先权:

申请号 | 申请日 | 专利标题

CN201410090386.XA|CN104916292B|2014-03-12|2014-03-12|Method and apparatus for detecting audio signals|

PCT/CN2014/092694|WO2015135344A1|2014-03-12|2014-12-01|Method and device for detecting audio signal|

[返回顶部]